This notebook handles exploring, cleaning, and feature engineering for the credit card dataset.
It starts with the root dataset and saves to a file named Cleaned.csv
This notebook was originally developed in Azure Machine Learning Studio against the Python 3.8 - AzureML kernel on a STANDARD_E4DS_V4 compute instance
This is a binary classification problem where we need to predict whether credit cards in the future are likely to default
The training data set includes a binary variable, default payment (Yes = 1, No = 0), as the target variable, and the following 23 variables as the features variables:
data file: https://raw.githubusercontent.com/franklin-univ-data-science/data/master/credit_default.csv
%pip install numpy
%pip install pandas
Requirement already satisfied: numpy in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (1.21.6) Note: you may need to restart the kernel to use updated packages. Requirement already satisfied: pandas in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (1.1.5) Requirement already satisfied: python-dateutil>=2.7.3 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from pandas) (2.8.2) Requirement already satisfied: numpy>=1.15.4 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from pandas) (1.21.6) Requirement already satisfied: pytz>=2017.2 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from pandas) (2022.5) Requirement already satisfied: six>=1.5 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas) (1.16.0) Note: you may need to restart the kernel to use updated packages.
# Set up Plotly express for visualization
%pip install plotly
import plotly.express as px
px.defaults.template = 'plotly_white'
px.defaults.color_continuous_scale = px.colors.sequential.Plasma
px.defaults.color_discrete_sequence = px.colors.qualitative.Vivid
Requirement already satisfied: plotly in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (5.15.0) Requirement already satisfied: packaging in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from plotly) (23.0) Requirement already satisfied: tenacity>=6.2.0 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from plotly) (8.2.2) Note: you may need to restart the kernel to use updated packages.
%pip install seaborn
import seaborn as sns
import numpy as np
def plot_correlation(corr, x_size=25, y_size=18):
""" Displays a diagonalized heatmap from a correlation matrix using seaborn. """
# This is a nice diverging palette that shows positive and negative correlations easily
cmap = sns.color_palette("seismic", as_cmap=True)
# Create a mask to cut off the top-right of the triangle
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
mask[np.diag_indices_from(mask)] = False # Show the 1.00 self-correlation diagonal line. It looks a little empty without it
# Create the heatmap
fig = sns.heatmap(corr, mask=mask, cmap=cmap, center=0, linewidths=.5, annot=True, fmt='.2f', vmin=-1, vmax=1, xticklabels=True, yticklabels=True)
fig.set_xticklabels(fig.get_xticklabels(), rotation=45, horizontalalignment='right')
fig.set_yticklabels(fig.get_yticklabels(), rotation=0, horizontalalignment='right')
fig.axes.set_title('Feature Correlation')
fig.figure.set_size_inches(x_size, y_size)
Requirement already satisfied: seaborn in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (0.12.2) Requirement already satisfied: pandas>=0.25 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from seaborn) (1.1.5) Requirement already satisfied: numpy!=1.24.0,>=1.17 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from seaborn) (1.21.6) Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from seaborn) (3.2.1) Requirement already satisfied: pytz>=2017.2 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from pandas>=0.25->seaborn) (2022.5) Requirement already satisfied: python-dateutil>=2.7.3 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from pandas>=0.25->seaborn) (2.8.2) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.1.0) Requirement already satisfied: cycler>=0.10 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0) Requirement already satisfied: kiwisolver>=1.0.1 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4) Requirement already satisfied: six>=1.5 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas>=0.25->seaborn) (1.16.0) Note: you may need to restart the kernel to use updated packages.
import pandas as pd
# Load credit history data
df = pd.read_csv('https://raw.githubusercontent.com/franklin-univ-data-science/data/master/credit_default.csv')
df.head()
| ID | X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | ... | X15 | X16 | X17 | X18 | X19 | X20 | X21 | X22 | X23 | Y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 20000 | 2 | 2 | 1 | 24 | 2 | 2 | -1 | -1 | ... | 0 | 0 | 0 | 0 | 689 | 0 | 0 | 0 | 0 | 1 |
| 1 | 2 | 120000 | 2 | 2 | 2 | 26 | -1 | 2 | 0 | 0 | ... | 3272 | 3455 | 3261 | 0 | 1000 | 1000 | 1000 | 0 | 2000 | 1 |
| 2 | 3 | 90000 | 2 | 2 | 2 | 34 | 0 | 0 | 0 | 0 | ... | 14331 | 14948 | 15549 | 1518 | 1500 | 1000 | 1000 | 1000 | 5000 | 0 |
| 3 | 4 | 50000 | 2 | 2 | 1 | 37 | 0 | 0 | 0 | 0 | ... | 28314 | 28959 | 29547 | 2000 | 2019 | 1200 | 1100 | 1069 | 1000 | 0 |
| 4 | 5 | 50000 | 1 | 2 | 1 | 57 | -1 | 0 | -1 | 0 | ... | 20940 | 19146 | 19131 | 2000 | 36681 | 10000 | 9000 | 689 | 679 | 0 |
5 rows × 25 columns
# That ID column is useless; let's use it as a row index
df.set_index('ID', inplace=True)
df.head()
| X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 | ... | X15 | X16 | X17 | X18 | X19 | X20 | X21 | X22 | X23 | Y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | |||||||||||||||||||||
| 1 | 20000 | 2 | 2 | 1 | 24 | 2 | 2 | -1 | -1 | -2 | ... | 0 | 0 | 0 | 0 | 689 | 0 | 0 | 0 | 0 | 1 |
| 2 | 120000 | 2 | 2 | 2 | 26 | -1 | 2 | 0 | 0 | 0 | ... | 3272 | 3455 | 3261 | 0 | 1000 | 1000 | 1000 | 0 | 2000 | 1 |
| 3 | 90000 | 2 | 2 | 2 | 34 | 0 | 0 | 0 | 0 | 0 | ... | 14331 | 14948 | 15549 | 1518 | 1500 | 1000 | 1000 | 1000 | 5000 | 0 |
| 4 | 50000 | 2 | 2 | 1 | 37 | 0 | 0 | 0 | 0 | 0 | ... | 28314 | 28959 | 29547 | 2000 | 2019 | 1200 | 1100 | 1069 | 1000 | 0 |
| 5 | 50000 | 1 | 2 | 1 | 57 | -1 | 0 | -1 | 0 | 0 | ... | 20940 | 19146 | 19131 | 2000 | 36681 | 10000 | 9000 | 689 | 679 | 0 |
5 rows × 24 columns
# Get a sense of its scale
df.shape
(30000, 24)
# Rename columns so they make more sense to read and chart
df.rename(columns={'X1' : 'Credit Amount',
'X2' : 'Gender',
'X3' : 'Education',
'X4' : 'Marital Status',
'X5' : 'Age in Years',
'X6' : 'Repay Delay Sep',
'X7' : 'Repay Delay Aug',
'X8' : 'Repay Delay Jul',
'X9' : 'Repay Delay Jun',
'X10': 'Repay Delay May',
'X11': 'Repay Delay Apr',
'X12': 'Bill Sep',
'X13': 'Bill Aug',
'X14': 'Bill Jul',
'X15': 'Bill Jun',
'X16': 'Bill May',
'X17': 'Bill Apr',
'X18': 'Prior Pay Sep',
'X19': 'Prior Pay Aug',
'X20': 'Prior Pay Jul',
'X21': 'Prior Pay Jun',
'X22': 'Prior Pay May',
'X23': 'Prior Pay Apr',
'Y' : 'Defaulted'
},
inplace=True)
df.head()
| Credit Amount | Gender | Education | Marital Status | Age in Years | Repay Delay Sep | Repay Delay Aug | Repay Delay Jul | Repay Delay Jun | Repay Delay May | ... | Bill Jun | Bill May | Bill Apr | Prior Pay Sep | Prior Pay Aug | Prior Pay Jul | Prior Pay Jun | Prior Pay May | Prior Pay Apr | Defaulted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | |||||||||||||||||||||
| 1 | 20000 | 2 | 2 | 1 | 24 | 2 | 2 | -1 | -1 | -2 | ... | 0 | 0 | 0 | 0 | 689 | 0 | 0 | 0 | 0 | 1 |
| 2 | 120000 | 2 | 2 | 2 | 26 | -1 | 2 | 0 | 0 | 0 | ... | 3272 | 3455 | 3261 | 0 | 1000 | 1000 | 1000 | 0 | 2000 | 1 |
| 3 | 90000 | 2 | 2 | 2 | 34 | 0 | 0 | 0 | 0 | 0 | ... | 14331 | 14948 | 15549 | 1518 | 1500 | 1000 | 1000 | 1000 | 5000 | 0 |
| 4 | 50000 | 2 | 2 | 1 | 37 | 0 | 0 | 0 | 0 | 0 | ... | 28314 | 28959 | 29547 | 2000 | 2019 | 1200 | 1100 | 1069 | 1000 | 0 |
| 5 | 50000 | 1 | 2 | 1 | 57 | -1 | 0 | -1 | 0 | 0 | ... | 20940 | 19146 | 19131 | 2000 | 36681 | 10000 | 9000 | 689 | 679 | 0 |
5 rows × 24 columns
# Read the schema
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 30000 entries, 1 to 30000 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Credit Amount 30000 non-null int64 1 Gender 30000 non-null int64 2 Education 30000 non-null int64 3 Marital Status 30000 non-null int64 4 Age in Years 30000 non-null int64 5 Repay Delay Sep 30000 non-null int64 6 Repay Delay Aug 30000 non-null int64 7 Repay Delay Jul 30000 non-null int64 8 Repay Delay Jun 30000 non-null int64 9 Repay Delay May 30000 non-null int64 10 Repay Delay Apr 30000 non-null int64 11 Bill Sep 30000 non-null int64 12 Bill Aug 30000 non-null int64 13 Bill Jul 30000 non-null int64 14 Bill Jun 30000 non-null int64 15 Bill May 30000 non-null int64 16 Bill Apr 30000 non-null int64 17 Prior Pay Sep 30000 non-null int64 18 Prior Pay Aug 30000 non-null int64 19 Prior Pay Jul 30000 non-null int64 20 Prior Pay Jun 30000 non-null int64 21 Prior Pay May 30000 non-null int64 22 Prior Pay Apr 30000 non-null int64 23 Defaulted 30000 non-null int64 dtypes: int64(24) memory usage: 5.7 MB
Looks like no null values in any columns
# Before we do anything more extreme, let's get descriptive statistics
df.describe()
| Credit Amount | Gender | Education | Marital Status | Age in Years | Repay Delay Sep | Repay Delay Aug | Repay Delay Jul | Repay Delay Jun | Repay Delay May | ... | Bill Jun | Bill May | Bill Apr | Prior Pay Sep | Prior Pay Aug | Prior Pay Jul | Prior Pay Jun | Prior Pay May | Prior Pay Apr | Defaulted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | ... | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 | 3.000000e+04 | 30000.00000 | 30000.000000 | 30000.000000 | 30000.000000 | 30000.000000 |
| mean | 167484.322667 | 1.603733 | 1.853133 | 1.551867 | 35.485500 | -0.016700 | -0.133767 | -0.166200 | -0.220667 | -0.266200 | ... | 43262.948967 | 40311.400967 | 38871.760400 | 5663.580500 | 5.921163e+03 | 5225.68150 | 4826.076867 | 4799.387633 | 5215.502567 | 0.221200 |
| std | 129747.661567 | 0.489129 | 0.790349 | 0.521970 | 9.217904 | 1.123802 | 1.197186 | 1.196868 | 1.169139 | 1.133187 | ... | 64332.856134 | 60797.155770 | 59554.107537 | 16563.280354 | 2.304087e+04 | 17606.96147 | 15666.159744 | 15278.305679 | 17777.465775 | 0.415062 |
| min | 10000.000000 | 1.000000 | 0.000000 | 0.000000 | 21.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 | -2.000000 | ... | -170000.000000 | -81334.000000 | -339603.000000 | 0.000000 | 0.000000e+00 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 50000.000000 | 1.000000 | 1.000000 | 1.000000 | 28.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | -1.000000 | ... | 2326.750000 | 1763.000000 | 1256.000000 | 1000.000000 | 8.330000e+02 | 390.00000 | 296.000000 | 252.500000 | 117.750000 | 0.000000 |
| 50% | 140000.000000 | 2.000000 | 2.000000 | 2.000000 | 34.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 19052.000000 | 18104.500000 | 17071.000000 | 2100.000000 | 2.009000e+03 | 1800.00000 | 1500.000000 | 1500.000000 | 1500.000000 | 0.000000 |
| 75% | 240000.000000 | 2.000000 | 2.000000 | 2.000000 | 41.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 54506.000000 | 50190.500000 | 49198.250000 | 5006.000000 | 5.000000e+03 | 4505.00000 | 4013.250000 | 4031.500000 | 4000.000000 | 0.000000 |
| max | 1000000.000000 | 2.000000 | 6.000000 | 3.000000 | 79.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 | ... | 891586.000000 | 927171.000000 | 961664.000000 | 873552.000000 | 1.684259e+06 | 896040.00000 | 621000.000000 | 426529.000000 | 528666.000000 | 1.000000 |
8 rows × 24 columns
Let's make these columns a bit more relevant to what we want to do
# Map defaulted values of 1 to True and 0 to False
df['Defaulted'] = df['Defaulted'].eq(1)
df['Defaulted'].value_counts()
False 23364 True 6636 Name: Defaulted, dtype: int64
# And again, but normalized so we see percentages
df['Defaulted'].value_counts(normalize=True)
False 0.7788 True 0.2212 Name: Defaulted, dtype: float64
# Plot a histogram of Defaulted
px.histogram(df, x='Defaulted', color='Defaulted', title='Target Class Distribution')
Congratulations, we have a minor class imbalance. It's not critical at 22%, but it's something to watch.
# See what the gender data looks like
df['Gender'].value_counts()
2 18112 1 11888 Name: Gender, dtype: int64
Slight imbalance there. We do know that 1 is male, so 2 may be female / other / decline to comment. Let's rename the column to be more clear
df['Gender'] = df['Gender'].eq(1)
df.rename(columns={'Gender': 'Is Male'}, inplace=True)
df['Is Male'].value_counts()
False 18112 True 11888 Name: Is Male, dtype: int64
# Plot a histogram of Is Male
px.histogram(df, x='Is Male', color='Defaulted', title='Gender Distribution')
# Let's look at education next. The requirements say "(1 = graduate school; 2 = university; 3 = high school; 4 = others)"
df['Education'].value_counts()
2 14030 1 10585 3 4917 5 280 4 123 6 51 0 14 Name: Education, dtype: int64
More lies! We have 5, 6, and 0 which are not mentioned. Let's roll them into the others category (4)
# Map education values of 1 to "Graduate School", 2 to "University", 3 to "High School", and everything else to "Other"
def map_education(education):
if education == 1:
return 'Graduate School'
elif education == 2:
return 'University'
elif education == 3:
return 'High School'
else:
return 'Other Education'
df['Education'] = df['Education'].apply(map_education)
df['Education'].value_counts()
University 14030 Graduate School 10585 High School 4917 Other Education 468 Name: Education, dtype: int64
# Plot a histogram of Education
px.histogram(df, x='Education', color='Defaulted', title='Education Distribution')
There's definite % differences in the different educational buckets, and there may not be enough rows in other or high school. Let's introduce dummy columns for all of these values.
df = pd.get_dummies(df, columns=['Education'], prefix='', prefix_sep='')
df.head()
| Credit Amount | Is Male | Marital Status | Age in Years | Repay Delay Sep | Repay Delay Aug | Repay Delay Jul | Repay Delay Jun | Repay Delay May | Repay Delay Apr | ... | Prior Pay Aug | Prior Pay Jul | Prior Pay Jun | Prior Pay May | Prior Pay Apr | Defaulted | Graduate School | High School | Other Education | University | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | |||||||||||||||||||||
| 1 | 20000 | False | 1 | 24 | 2 | 2 | -1 | -1 | -2 | -2 | ... | 689 | 0 | 0 | 0 | 0 | True | 0 | 0 | 0 | 1 |
| 2 | 120000 | False | 2 | 26 | -1 | 2 | 0 | 0 | 0 | 2 | ... | 1000 | 1000 | 1000 | 0 | 2000 | True | 0 | 0 | 0 | 1 |
| 3 | 90000 | False | 2 | 34 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1500 | 1000 | 1000 | 1000 | 5000 | False | 0 | 0 | 0 | 1 |
| 4 | 50000 | False | 1 | 37 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 2019 | 1200 | 1100 | 1069 | 1000 | False | 0 | 0 | 0 | 1 |
| 5 | 50000 | True | 1 | 57 | -1 | 0 | -1 | 0 | 0 | 0 | ... | 36681 | 10000 | 9000 | 689 | 679 | False | 0 | 0 | 0 | 1 |
5 rows × 27 columns
# The requirements say "(1 = married; 2 = single; 3 = others)". Let's see if that's what we have
df['Marital Status'].value_counts()
2 15964 1 13659 3 323 0 54 Name: Marital Status, dtype: int64
# Map marital status values of 1 to "Married", 2 to "Single", and everything else to "Other"
def map_marital_status(status):
if status == 1:
return 'Married'
elif status == 2:
return 'Single'
else:
return 'Other Marital Status'
df['Marital Status'] = df['Marital Status'].apply(map_marital_status)
df['Marital Status'].value_counts()
Single 15964 Married 13659 Other Marital Status 377 Name: Marital Status, dtype: int64
# Plot a histogram of Marital Status
px.histogram(df, x='Marital Status', color='Defaulted', title='Marital Status Distribution')
Okay, so other is nearly non-existent. This means that married and single could likely be reduced to a single boolean column for Is Married. Other correlation analysis supports this, so let's add the column and drop the marital status column.
df['Is Married'] = df['Marital Status'].eq('Married')
df.drop(columns=['Marital Status'], inplace=True)
df['Is Married'].value_counts()
False 16341 True 13659 Name: Is Married, dtype: int64
df['Repay Delay Sep'].value_counts()
0 14737 -1 5686 1 3688 -2 2759 2 2667 3 322 4 76 5 26 8 19 6 11 7 9 Name: Repay Delay Sep, dtype: int64
The repayment scale is weird with -1 indicating 0 months delay and 1+ indicating the number of months delay. It'll be easier to calculate if we move those -1's to 0's.
Also note that there's a -2 value which the requirements don't mention. Let's assume that means the customer paid early and treat it as 0 months so our scale starts at 0
Reasoning: We don't care how early they paid, we just care if* they paid and didn't later default*
# Make sure all repayment columns have a minimum value of 0. Let's ignore negative values
for column in df.columns:
if column.startswith('Repay Delay'):
df[column] = df[column].apply(lambda x: max(x, 0))
fig = px.histogram(df, x=column, color='Defaulted', title=column + ' Distribution')
fig.show()
Some of our columns beg for additional columns to be engineered involving them
It's probably helpful to have total columns incidating the total amount for bill / prior pay / repay delay across all time periods
df['Bill Total'] = df['Bill Sep'] + df['Bill Aug'] + df['Bill Jul'] + df['Bill Jun'] + df['Bill May'] + df['Bill Apr']
df['Prior Pay Total'] = df['Prior Pay Sep'] + df['Prior Pay Aug'] + df['Prior Pay Jul'] + df['Prior Pay Jun'] + df['Prior Pay May'] + df['Prior Pay Apr']
df['Repay Delay Total'] = df['Repay Delay Sep'] + df['Repay Delay Aug'] + df['Repay Delay Jul'] + df['Repay Delay Jun'] + df['Repay Delay May'] + df['Repay Delay Apr']
df.head()
| Credit Amount | Is Male | Age in Years | Repay Delay Sep | Repay Delay Aug | Repay Delay Jul | Repay Delay Jun | Repay Delay May | Repay Delay Apr | Bill Sep | ... | Prior Pay Apr | Defaulted | Graduate School | High School | Other Education | University | Is Married | Bill Total | Prior Pay Total | Repay Delay Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | |||||||||||||||||||||
| 1 | 20000 | False | 24 | 2 | 2 | 0 | 0 | 0 | 0 | 3913 | ... | 0 | True | 0 | 0 | 0 | 1 | True | 7704 | 689 | 4 |
| 2 | 120000 | False | 26 | 0 | 2 | 0 | 0 | 0 | 2 | 2682 | ... | 2000 | True | 0 | 0 | 0 | 1 | False | 17077 | 5000 | 4 |
| 3 | 90000 | False | 34 | 0 | 0 | 0 | 0 | 0 | 0 | 29239 | ... | 5000 | False | 0 | 0 | 0 | 1 | False | 101653 | 11018 | 0 |
| 4 | 50000 | False | 37 | 0 | 0 | 0 | 0 | 0 | 0 | 46990 | ... | 1000 | False | 0 | 0 | 0 | 1 | True | 231334 | 8388 | 0 |
| 5 | 50000 | True | 57 | 0 | 0 | 0 | 0 | 0 | 0 | 8617 | ... | 679 | False | 0 | 0 | 0 | 1 | True | 109339 | 59049 | 0 |
5 rows × 30 columns
px.histogram(df, x='Bill Total', color='Defaulted', title='Bill Total Distribution')
px.histogram(df, x='Prior Pay Total', color='Defaulted', title='Prior Pay Total Distribution')
px.histogram(df, x='Repay Delay Total', color='Defaulted', title='Repay Delay Total Distribution')
Interesting how a total repay delay of 1 is much less likely to default
We probably don't care what the prior payment amount or the bill amount, we just care about their balance (bill - payment)
df['Balance Sep'] = df['Bill Sep'] - df['Prior Pay Sep']
df['Balance Aug'] = df['Bill Aug'] - df['Prior Pay Aug']
df['Balance Jul'] = df['Bill Jul'] - df['Prior Pay Jul']
df['Balance Jun'] = df['Bill Jun'] - df['Prior Pay Jun']
df['Balance May'] = df['Bill May'] - df['Prior Pay May']
df['Balance Apr'] = df['Bill Apr'] - df['Prior Pay Apr']
df['Balance Total'] = df['Bill Total'] - df['Prior Pay Total']
df.head()
| Credit Amount | Is Male | Age in Years | Repay Delay Sep | Repay Delay Aug | Repay Delay Jul | Repay Delay Jun | Repay Delay May | Repay Delay Apr | Bill Sep | ... | Bill Total | Prior Pay Total | Repay Delay Total | Balance Sep | Balance Aug | Balance Jul | Balance Jun | Balance May | Balance Apr | Balance Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | |||||||||||||||||||||
| 1 | 20000 | False | 24 | 2 | 2 | 0 | 0 | 0 | 0 | 3913 | ... | 7704 | 689 | 4 | 3913 | 2413 | 689 | 0 | 0 | 0 | 7015 |
| 2 | 120000 | False | 26 | 0 | 2 | 0 | 0 | 0 | 2 | 2682 | ... | 17077 | 5000 | 4 | 2682 | 725 | 1682 | 2272 | 3455 | 1261 | 12077 |
| 3 | 90000 | False | 34 | 0 | 0 | 0 | 0 | 0 | 0 | 29239 | ... | 101653 | 11018 | 0 | 27721 | 12527 | 12559 | 13331 | 13948 | 10549 | 90635 |
| 4 | 50000 | False | 37 | 0 | 0 | 0 | 0 | 0 | 0 | 46990 | ... | 231334 | 8388 | 0 | 44990 | 46214 | 48091 | 27214 | 27890 | 28547 | 222946 |
| 5 | 50000 | True | 57 | 0 | 0 | 0 | 0 | 0 | 0 | 8617 | ... | 109339 | 59049 | 0 | 6617 | -31011 | 25835 | 11940 | 18457 | 18452 | 50290 |
5 rows × 37 columns
# Show histograms of all balance columns
for column in df.columns:
if column.startswith('Balance'):
fig = px.histogram(df, x=column, color='Defaulted', title=column + ' Distribution')
fig.show()
fig = px.scatter_3d(df, x='Balance Total', z='Prior Pay Total', y='Bill Total', color='Defaulted', title='Balance Relationships')
# Make the figure tall so we can see it
fig.update_layout(height=600)
fig = px.scatter_3d(df, x='Balance Total', z='Prior Pay Total', y='Credit Amount', color='Defaulted', title='Balance Relationships with Credit Amount')
# Make the figure tall so we can see it
fig.update_layout(height=600)
fig = px.scatter_3d(df, x='Credit Amount', z='Prior Pay Total', y='Repay Delay Total', color='Defaulted', title='Balance Relationships with Repay Delay Total')
# Make the figure tall so we can see it
fig.update_layout(height=600)
fig = px.scatter_3d(df, x='Credit Amount', z='Prior Pay Total', y='Repay Delay Total', color='Balance Total', title='Balance Relationships with Credit Amount')
# Make the figure tall so we can see it
fig.update_layout(height=600)
px.histogram(df, x='Credit Amount', title='Credit Amount Distribution', color='Defaulted', marginal='box')
px.histogram(df, x='Prior Pay Total', title='Prior Pay Distribution', color='Defaulted', marginal='box')
px.histogram(df, x='Repay Delay Total', title='Repay Delay Distribution', color='Defaulted', marginal='box')
Based on this, we probably should ignore rows with prior payment of > 2.5 million, prior repay delay total of > 5, and prior pay of $100k. This makes these data points less likely to influence the model training process and clarifies the effective range of our model.
We clearly have some outliers. These will impact our model training and correlation analysis, so let's get rid of them now.
# Create a new filtered dataframe that removes identified outliers
df = df[df['Prior Pay Total'] <= 2500000]
df = df[df['Prior Pay Total'] <= 100000]
df = df[df['Repay Delay Total'] <= 5]
df.describe()
| Credit Amount | Age in Years | Repay Delay Sep | Repay Delay Aug | Repay Delay Jul | Repay Delay Jun | Repay Delay May | Repay Delay Apr | Bill Sep | Bill Aug | ... | Bill Total | Prior Pay Total | Repay Delay Total | Balance Sep | Balance Aug | Balance Jul | Balance Jun | Balance May | Balance Apr | Balance Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 24738.000000 | 24738.000000 | 24738.000000 | 24738.000000 | 24738.000000 | 24738.000000 | 24738.000000 | 24738.000000 | 24738.000000 | 24738.000000 | ... | 2.473800e+04 | 24738.000000 | 24738.000000 | 24738.000000 | 24738.000000 | 24738.000000 | 24738.000000 | 24738.000000 | 24738.000000 | 2.473800e+04 |
| mean | 166257.970733 | 35.436211 | 0.202967 | 0.129477 | 0.103646 | 0.066942 | 0.049721 | 0.074177 | 48214.109588 | 45427.235185 | ... | 2.426526e+05 | 21360.925055 | 0.626930 | 44169.446398 | 41483.419314 | 38818.686191 | 34859.830827 | 31757.426186 | 30202.898294 | 2.212917e+05 |
| std | 124821.754166 | 9.210237 | 0.510864 | 0.494391 | 0.445713 | 0.360742 | 0.312572 | 0.381638 | 69199.590220 | 66268.583243 | ... | 3.509089e+05 | 21019.749727 | 1.223993 | 67712.457447 | 64904.621849 | 61814.285509 | 56851.045257 | 53138.152046 | 51886.413080 | 3.415802e+05 |
| min | 10000.000000 | 21.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -11545.000000 | -69777.000000 | ... | -2.595230e+05 | 0.000000 | 0.000000 | -89921.000000 | -139777.000000 | -90676.000000 | -90671.000000 | -89273.000000 | -122246.000000 | -2.599780e+05 |
| 25% | 60000.000000 | 28.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2551.000000 | 2000.000000 | ... | 2.072225e+04 | 6505.000000 | 0.000000 | 384.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.401250e+03 |
| 50% | 140000.000000 | 34.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 19841.500000 | 18790.000000 | ... | 1.048005e+05 | 13837.500000 | 0.000000 | 17205.500000 | 16343.500000 | 15540.000000 | 13954.000000 | 11531.000000 | 9816.000000 | 8.677200e+04 |
| 75% | 230000.000000 | 41.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 63569.250000 | 59516.750000 | ... | 3.000275e+05 | 29904.750000 | 1.000000 | 60135.500000 | 56063.250000 | 51009.250000 | 46558.750000 | 42179.250000 | 39727.250000 | 2.805608e+05 |
| max | 800000.000000 | 79.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 3.000000 | 746814.000000 | 605943.000000 | ... | 2.862678e+06 | 100000.000000 | 5.000000 | 726314.000000 | 591943.000000 | 663643.000000 | 470005.000000 | 535865.000000 | 679638.000000 | 2.786678e+06 |
8 rows × 34 columns
plot_correlation(df.corr())
This is pretty busy, but I'm immediately drawn to the bill columns not being significantly relevant to defaulted, so let's drop those columns
df.drop(columns=['Bill Sep', 'Bill Aug', 'Bill Jul', 'Bill Jun', 'Bill May', 'Bill Apr', 'Bill Total'], inplace=True)
plot_correlation(df.corr())
Looking more closely, the balance columns don't really impact Default and tend to be correlated with prior payment columns and credit amount, so let's drop those columns
df.drop(columns=['Balance Sep', 'Balance Aug', 'Balance Jul', 'Balance Jun', 'Balance May', 'Balance Apr', 'Balance Total'], inplace=True)
plot_correlation(df.corr(), x_size=18, y_size=12)
# Other education is meaningless, so let's drop it
df.drop(columns=['Other Education'], inplace=True)
# Prior pay June - April isn't all that relevant, so let's drop it too. We'll keep September and August as well as the total
df.drop(columns=['Prior Pay Jun', 'Prior Pay May', 'Prior Pay Apr'], inplace=True)
plot_correlation(df.corr(), x_size=16, y_size=10)
Is Male, Is Married, and Age in Years aren't very relevant, but might be areas of bias in the model we'd like to know about. Let's keep them for bias detection, but we shouldn't give them to our model
# Let's examine our remaining educational features and their correlations to defaulted
corr = df[['Defaulted', 'Credit Amount', 'High School', 'University', 'Graduate School']].corr()
plot_correlation(corr, x_size=4, y_size=3)
Seems like education might just be a component of credit amount. Let's represent education just by graduate school since that 1 or 0 will inform the impact on credit and on defaulted
df.drop(columns=['High School', 'University'], inplace=True)
plot_correlation(df.corr(), x_size=12, y_size=8)
# Let's drop Repay Delays past July to simplify things since they're not very significant and represented in the total
df.drop(columns=['Repay Delay Jun', 'Repay Delay May', 'Repay Delay Apr'], inplace=True)
plot_correlation(df.corr(), x_size=12, y_size=8)
This may still be more than we need, but we'll say these are our final features. Let's see the correlations without the bias detection columns:
plot_correlation(df.drop(columns=['Is Male', 'Age in Years', 'Is Married']).corr(), x_size=9, y_size=6) # Doesn't actually remove the columns from df
Now that we've dropped non-relevant bits, lets save the data to a CSV file called cleaned.csv
# Save the file to disk at cleaned.csv, ensuring the index is also persisted
df.to_csv('cleaned.csv', index=True)
Work continues in TrainTestSplit.ipynb